179 research outputs found
Towards the automated analysis of simple polyphonic music : a knowledge-based approach
PhDMusic understanding is a process closely related to the knowledge and experience
of the listener. The amount of knowledge required is relative to the
complexity of the task in hand.
This dissertation is concerned with the problem of automatically decomposing
musical signals into a score-like representation. It proposes that, as
with humans, an automatic system requires knowledge about the signal and
its expected behaviour to correctly analyse music.
The proposed system uses the blackboard architecture to combine the
use of knowledge with data provided by the bottom-up processing of the
signal's information. Methods are proposed for the estimation of pitches,
onset times and durations of notes in simple polyphonic music.
A method for onset detection is presented. It provides an alternative to
conventional energy-based algorithms by using phase information. Statistical
analysis is used to create a detection function that evaluates the expected
behaviour of the signal regarding onsets.
Two methods for multi-pitch estimation are introduced. The first concentrates
on the grouping of harmonic information in the frequency-domain.
Its performance and limitations emphasise the case for the use of high-level
knowledge.
This knowledge, in the form of the individual waveforms of a single
instrument, is used in the second proposed approach. The method is based
on a time-domain linear additive model and it presents an alternative to
common frequency-domain approaches.
Results are presented and discussed for all methods, showing that, if
reliably generated, the use of knowledge can significantly improve the quality
of the analysis.Joint Information Systems Committee (JISC) in the UK National Science Foundation (N.S.F.) in the United states. Fundacion Gran Mariscal Ayacucho in Venezuela
Robust sound event detection in bioacoustic sensor networks
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs),
can record sounds of wildlife over long periods of time in scalable and
minimally invasive ways. Deriving per-species abundance estimates from these
sensors requires detection, classification, and quantification of animal
vocalizations as individual acoustic events. Yet, variability in ambient noise,
both over time and across sensors, hinders the reliability of current automated
systems for sound event detection (SED), such as convolutional neural networks
(CNN) in the time-frequency domain. In this article, we develop, benchmark, and
combine several machine listening techniques to improve the generalizability of
SED models across heterogeneous acoustic environments. As a case study, we
consider the problem of detecting avian flight calls from a ten-hour recording
of nocturnal bird migration, recorded by a network of six ARUs in the presence
of heterogeneous background noise. Starting from a CNN yielding
state-of-the-art accuracy on this task, we introduce two noise adaptation
techniques, respectively integrating short-term (60 milliseconds) and long-term
(30 minutes) context. First, we apply per-channel energy normalization (PCEN)
in the time-frequency domain, which applies short-term automatic gain control
to every subband in the mel-frequency spectrogram. Secondly, we replace the
last dense layer in the network by a context-adaptive neural network (CA-NN)
layer. Combining them yields state-of-the-art results that are unmatched by
artificial data augmentation alone. We release a pre-trained version of our
best performing system under the name of BirdVoxDetect, a ready-to-use detector
of avian flight calls in field recordings.Comment: 32 pages, in English. Submitted to PLOS ONE journal in February 2019;
revised August 2019; published October 201
Sound Source Distance Estimation in Diverse and Dynamic Acoustic Conditions
Localizing a moving sound source in the real world involves determining its
direction-of-arrival (DOA) and distance relative to a microphone. Advancements
in DOA estimation have been facilitated by data-driven methods optimized with
large open-source datasets with microphone array recordings in diverse
environments. In contrast, estimating a sound source's distance remains
understudied. Existing approaches assume recordings by non-coincident
microphones to use methods that are susceptible to differences in room
reverberation. We present a CRNN able to estimate the distance of moving sound
sources across multiple datasets featuring diverse rooms, outperforming a
recently-published approach. We also characterize our model's performance as a
function of sound source distance and different training losses. This analysis
reveals optimal training using a loss that weighs model errors as an inverse
function of the sound source true distance. Our study is the first to
demonstrate that sound source distance estimation can be performed across
diverse acoustic conditions using deep learning.Comment: Accepted in WASPAA 202
Actitudes hacia la estadĂstica de los estudiantes de psicologĂa, ingenierĂa y economĂa de la Universidad CatĂłlica de Colombia del semestre 2017-1
La escala de actitudes desarrollada por Estrada (2002) y validada por sus propiedades psicomĂ©tricas en 2012 se adaptĂł al lenguaje de los alumnos de la Universidad y se aplicĂł a 588 estudiantes de los programas de EconomĂa, IngenierĂa y PsicologĂa. Con la base de datos obtenida y la teorĂa clásica de los test y la teorĂa de repuesta al Ătem, se validĂł otra vez la escala. No se encontrĂł diferencia en las actitudes por carrera, gĂ©nero o jornada y se hallĂł algo de favorabilidad en edades jĂłvenes frente a mayores. En el análisis multivariado se identificaron cuatro componentes principales.1a ediciĂł
Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries
Finding the right sound effects (SFX) to match moments in a video is a
difficult and time-consuming task, and relies heavily on the quality and
completeness of text metadata. Retrieving high-quality (HQ) SFX using a video
frame directly as the query is an attractive alternative, removing the reliance
on text metadata and providing a low barrier to entry for non-experts. Due to
the lack of HQ audio-visual training data, previous work on audio-visual
retrieval relies on YouTube (in-the-wild) videos of varied quality for
training, where the audio is often noisy and the video of amateur quality. As
such it is unclear whether these systems would generalize to the task of
matching HQ audio to production-quality video. To address this, we propose a
multimodal framework for recommending HQ SFX given a video frame by (1)
leveraging large language models and foundational vision-language models to
bridge HQ audio and video to create audio-visual pairs, resulting in a highly
scalable automatic audio-visual data curation pipeline; and (2) using
pre-trained audio and visual encoders to train a contrastive learning-based
retrieval system. We show that our system, trained using our automatic data
curation pipeline, significantly outperforms baselines trained on in-the-wild
data on the task of HQ SFX retrieval for video. Furthermore, while the
baselines fail to generalize to this task, our system generalizes well from
clean to in-the-wild data, outperforming the baselines on a dataset of YouTube
videos despite only being trained on the HQ audio-visual pairs. A user study
confirms that people prefer SFX retrieved by our system over the baseline 67%
of the time both for HQ and in-the-wild data. Finally, we present ablations to
determine the impact of model and data pipeline design choices on downstream
retrieval performance. Please visit our project website to listen to and view
our SFX retrieval results.Comment: WASPAA 2023. Project page:
https://juliawilkins.github.io/sound-effects-retrieval-from-video/. 4 pages,
2 figures, 2 table
- …